OpsHelm

Announcing OpsHelm EventStream

Kyle McCullough

Lee Brotherston

Oct 28, 2024

9 min read

Enriched, normalized, real-time events for your cloud environment

In this post we will give you a brief overview of OpsHelm EventStream, benefits of using it, and some example use cases

Cloud environments are constantly changing. With all of the automated systems, IaC deployments, manual actions, system failures, and ephemeral resources coming and going, even a small environment can be a challenge to monitor. Of course, the flexibility that enables this chaos is a feature, not a bug. One of the primary reasons many organizations have invested in the cloud is the ability to rapidly scale their infrastructure up and down to meet fluctuating demands.

However, this presents some challenges. In most organizations, it’s common for many individuals and systems to have some ability to make changes to a given cloud environment. This makes it very difficult to effectively track what resources are changing, how they’re changing, and who is responsible for those changes - even for organizations that have invested heavily in infrastructure as code (IaC) and built robust change control processes.

Within a given cloud provider, there is often a lot of inconsistency with regards to the format and content of log messages and events between different services. This issue is further exacerbated if you use multiple providers. In many cases a single event may not contain all of the relevant information about a change, so correlating multiple events from multiple sources is required in order to understand the complete context.

The Challenge

Although it’s technically possible to manually sift through cloud audit logs and piece things back together, there are many problems with this approach:

This does not allow for any programmatic detection or response to unexpected or otherwise problematic changes.
Human analysis of logs neither scales nor is it cost effective compared to automated processing.
An extenuating or special case will often be some sort of incident such as a security issue, an outage, or other time-sensitive problem. This is not the ideal time for spending time manually sourcing and analyzing logs.

Introducing EventStream

OpsHelm already consumes the appropriate logs and events from your cloud provider, enriches and normalizes them, and maintains an asset inventory for your cloud environment. Alongside this, we are launching EventStream.

EventStream is, as the name suggests, a stream of events for your cloud environment which has been enriched with additional data, normalized to facilitate programmatic processing, and provided as a single stream of data that you can consume. We also provide the raw event from your cloud provider in the message so that it can easily be integrated into any existing tooling that you have.

The stream uses the SSE (server-sent events) protocol to deliver the stream, which can be easily consumed using one of the many libraries available to subscribe to an SSE stream, with each event being contained within an easily parsable JSON object.

Normalization

Some example normalizations:

Each event has a consistent type, either com.opshelm.asset.create, com.opshelm.asset.update, or com.opshelm.asset.delete, making classification and routing of each event type simple.

For comparison here are just a few of the event types within AWS for a simple change event: UpdateRole, ModifySecurityGroupRules, AddPermission, AssignPrivateNatGatewayAddress, DeleteBucketPolicy, EnableKeyRotation, RemovePermission, SetEndpointAttributes, and SubmitTaskStateChange
Each event contains an OpsHelm name for the asset (as well as the providers’ own name for the asset), which unlike the cloud provider’s naming, remains consistent and so can be used to uniquely and consistently refer to the same asset.
Each event contains a normalized asset type, to facilitate easy classification of groups of like assets.
Each event is timestamped using the same timestamp formatting standard and timezone in order to reduce confusion or misunderstandings when constructing a timeline of events.

Enrichment

Some example event enrichment:

Events are enriched with attribution for the change. This includes the IP address, User Agent, ID and Assumed ID of the account making the change as well as a determination as to if that change was made in browser, by CLI, by Terraform, an SDK, or as part of the cloud providers’ internal operations (e.g. scaling up and down).
Events are enriched with asset associations. For example if an event pertained to a compute instance, it could include references to the associated storage, network subnets, etc.
com.opshelm.asset.update, or com.opshelm.asset.delete events contain not only the new configuration of an asset as given by your cloud provider
Events are enriched using the OpsHelm asset inventory with the previous configuration of the asset. This makes it very easy to quickly determine exactly what the change was.

Use cases

Of course the challenges and goals of each organization will be different, which is one of the reasons that we created EventStream, however I would like to take a moment to mention a few potential use cases…

Detect compromised internal accounts

A common attack vector is to obtain the credentials of someone authorized to access a cloud environment to gain access.

There are a number of checks that can be implemented very easily using EventStream to identify a number of these attacks:

Using the IP address portion of the attribution enrichment, maintain a history of IP addresses used to access the environment and alert on a new source of change, an IP from an unexpected geography, or test for an impossible journey
If your organization always uses a specific way to update your cloud configuration, for example changes are made using Terraform, then using the attribution enrichment to identify changes made via a browser may be effective.
Using the attribution enrichment, identify changes being made by a principal or assumed role that is not expected to make changes.
Using the attribution enrichment, to identify the principal ID that made a change and automate checks in other services. For example using haveibeenpwnd to ensure that that account has not been identified in any breaches recently.

Incident response

During an incident there is likely to be a lot of change, often a mixture of remediation activities and the attackers attempts to circumvent remediation activities. In these situations time is of the essence and so being able to the gather and correlate information from a number of sources in order to track what is happening is most likely not a viable option.

The normalized event types and asset names mean that creating a real-time list of which assets have been created, changed, or deleted is simple.
By using the previous configuration enrichment this could be expanded to show a configuration diff, providing a way to rapidly understand the changes taking place.
In the case of a delete event the previous configuration enrichment will show you what has just been deleted. Generally when an asset is deleted a cloud provider will also delete it’s configuration making that information unavailable to you.

Anomalies

Anomalies, that is, something occurring in your cloud environment which appears in some way different to “normal” changes, can be a way to detect malicious activities without having to determine a specific pattern to look for.

The detect compromised internal accounts use case already highlights some examples of anomily detection, but there are some other anomalies that would be relatively easy to detect using EventStream:

Monitor the order of magnitude of changes made at time of day and day of week. Using the normalized events obtaining a timestamp and change type is very simple. If your organization tends to operate on weekdays, a flourish of changes on a Sunday afternoon would immediately stand out.
Monitor for any change in unexpected regions. New assets being created in a different region to the rest of your infrastructure can at best be a costly typo due to having to transit data between that and your intended region, but at worst can be the sign of a compromise as adversaries have been known to create assets in other regions than their victims operate in order to avoid being detected by monitoring systems. If you have ever tried to find an asset in a cloud providers’ console without knowing which region it is in, I am sure you will know this pain.

Metrics, dashboards, and analysis

We don’t want to provide your “18th single pane of glass” to understand your environment, that is not going to streamline anything. However, it is entirely likely that there is a single pane of glass that your SRE, Operations, or Security functions use.

EventStream can be used to provide additional metrics via both normalization to correctly categorize and group events, and enrichment to provide additional data.

For example generating graphs or reports pertaining to the rate of change over time, the assets which change most/least frequently, the users making most changes, the busiest time of day for change, etc. can be used for capacity planning, cost management, or operational excellence.

Configuration drift & periodic scan enhancement

Detecting configuration drift, or any other kind of issue for that matter, via a periodic check can miss important detail. At a high level an unexpected configuration change being made, forming part of some activity, and then being undone in between two scheduled checks, will most likely not appear in any of the data gathered by these checks. EventStream has been made so that receive all the pertinent events, not point in time snapshots, enabling you to reconstruct the sequence of events.

For example:

Periodic check / scan 1 completes
A user is created in the IAM panel and granted access to a database
Firewall rules are modified to expose the database to an attackers IP address
Attacker uses the new user to access the database via the firewall rule and copy the database
The firewall is returned to its previous state
The user is removed from IAM
Periodic check / scan 2 starts

In this example it is likely that the periodic check / scan will not notice anything, and even if it does it is likely that it will not be able to provide you details for the investigation. EventStream’s previous configuration enrichment means that you would not only have the events pertaining to the change, but will know exactly what changed. In this case you would know the username which added and removed, and the IP address which was added/removed from the firewall. In addition the attribution enrichment also means that you would immediately know which user account performed these actions, from which IP address, using which configuration method. We feel that this information is much more pertinent to an investigation into such an issue.

Route to, and integrate with, your existing tools

Of course many organizations will have an existing suite of tools that they do not wish to part ways with. It is entirely possible to use EventStream to route:

The normalized event types make it very easy to route events to different tools depending on the asset type, or if the event is of type create, update, or delete.
We include the raw message from your cloud provider in the event so that it can be easily parsed by existing tools using this format.

In Conclusion…

EventStream brings log consistency that we wanted from our cloud providers to enable you to create your own tooling more rapidly and in a more robust fashion. It also brings enrichments which provide great context and insight into events to be able to make more intelligent and actionable tooling and responses.